Fast Discerning Repeats in DNA Sequences with a Compression Algorithm
نویسندگان
چکیده
Long direct repeats in genomes arise from molecular duplication mechanisms like retrotransposition, copy of genes, exon shu ing, . . . Their study in a given sequence reveals its internal repeat structure as well as part of its evolutionary history. Moreover, detailed knowledge about the mechanisms can be gained from a systematic investigation of repeats. The problem of nding such repeats is viewed as an NP-complete problem of the optimal compression of a sequence thanks to the encoding of its exact repeats. The repeats chosen for compression must not overlap each other as do the repeats which result from molecular duplications. We present a new heuristic algorithm, Search Repeats, where the selection of exact repeats is guided by two biologically sound criteria: their length and the absence of overlap between those repeats. Search Repeats detects approximate repeats, as clusters of exact sub-repeats, and points out large insertions/deletions in them. Search Repeats takes only 3 seconds of CPU time for the genome of Haemophilus in uenzae on a Sun Ultrasparc workstation.
منابع مشابه
gpALIGNER: A Fast Algorithm for Global Pairwise Alignment of DNA Sequences
Bioinformatics, through the sequencing of the full genomes for many species, is increasingly relying on efficient global alignment tools exhibiting both high sensitivity and specificity. Many computational algorithms have been applied for solving the sequence alignment problem. Dynamic programming, statistical methods, approximation and heuristic algorithms are the most common methods appli...
متن کاملA simple and fast DNA compressor
In this paper we describe a new DNA compression algorithm. It is well known that one of the main features of DNA sequences is that they contain substrings which are duplicated except for a few random mutations. For this reason most DNA compressors work by searching and encoding approximate repeats. We depart from this strategy by searching and encoding only exact repeats. However, we use an enc...
متن کاملDetection of Signiicant Patterns by Compression Algorithms : the Case of Approximate Tandem Repeats in Dna Sequences. Rivals
0 To whom the reprint requests should be sent. 2 Abstract We use compression algorithms to analyse genetic sequences. The basic idea is that a compression algorithm is associated with a property. The more a sequence is compressed by the algorithm, the more signiicant is the property for that sequence. Here we present an algorithm to detect a particular type of dosDNA (Deened Ordered Sequence-DN...
متن کاملDNABIT Compress – Genome compression algorithm
Data compression is concerned with how information is organized in data. Efficient storage means removal of redundancy from the data being stored in the DNA molecule. Data compression algorithms remove redundancy and are used to understand biologically important molecules. We present a compression algorithm, "DNABIT Compress" for DNA sequences based on a novel algorithm of assigning binary bits...
متن کاملReference Sequence Construction for Relative Compression of Genomes
Relative compression, where a set of similar strings are compressed with respect to a reference string, is a very effective method of compressing DNA datasets containing multiple similar sequences. Relative compression is fast to perform and also supports rapid random access to the underlying data. The main difficulty of relative compression is in selecting an appropriate reference sequence. In...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1997